Use GPU on interactive session#
List of partitions#
This will show the lost of partitions and GPU names as well with the NODELIST
sinfo -o "%.10P %.5a %.10l %.6D %.6t %.20N %.10G"
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST GRES
compute* up 3-00:00:00 1 drain* scs0123 (null)
compute* up 3-00:00:00 2 down* scs[0022,0050] (null)
compute* up 3-00:00:00 36 mix scs[0007,0009-0010,0 (null)
compute* up 3-00:00:00 35 alloc scs[0001-0006,0008,0 (null)
compute* up 3-00:00:00 48 idle scs[0049,0051-0062,0 (null)
compute* up 3-00:00:00 1 down scs0100 (null)
developmen up 30:00 1 drain* scs0123 (null)
developmen up 30:00 2 down* scs[0022,0050] (null)
developmen up 30:00 36 mix scs[0007,0009-0010,0 (null)
developmen up 30:00 35 alloc scs[0001-0006,0008,0 (null)
developmen up 30:00 48 idle scs[0049,0051-0062,0 (null)
developmen up 30:00 1 down scs0100 (null)
gpu up 2-00:00:00 1 mix scs2003 gpu:v100:2
gpu up 2-00:00:00 2 alloc scs[2001-2002] gpu:v100:2
gpu up 2-00:00:00 1 idle scs2004 gpu:v100:2
accel_ai up 2-00:00:00 2 mix scs[2041,2043] gpu:a100:8
accel_ai up 2-00:00:00 3 idle scs[2042,2044-2045] gpu:a100:8
accel_ai_d up 2:00:00 2 mix scs[2041,2043] gpu:a100:8
accel_ai_d up 2:00:00 3 idle scs[2042,2044-2045] gpu:a100:8
accel_ai_m up 12:00:00 1 idle scs2046 gpu:1g.5gb
s_highmem_ up 3-00:00:00 2 idle scs[0151-0152] (null)
s_compute_ up 3-00:00:00 1 mix scs3001 (null)
s_compute_ up 3-00:00:00 1 idle scs3003 (null)
s_compute_ up 1:00:00 1 mix scs3001 (null)
s_compute_ up 1:00:00 1 idle scs3003 (null)
s_gpu_eng up 2-00:00:00 1 idle scs2021 gpu:v100:4
In this example I will use * PARTITION: accel_ai (because I have access to it) * Go here: https://scw.bangor.ac.uk/en/projects/memberships/ to check your memberships * Make sure the STATE is idle or mix not drain or down. * Here, accel_ai has 8 Nvidia A100 40 GB GPU. * We can use any node from scs[2042,2044-2045]
Start an interactive session#
At first use salloc to reserve resources.
salloc --nodes=1 --account=scw1901 --partition=accel_ai --gres=gpu:1
Go here: https://scw.bangor.ac.uk/en/projects/memberships/
As discussed earlier we will use any node from accel_ai. Slurm will assign an idle node. I requested for only GPU.
[s.1915438@sl1 experiment]$ salloc --nodes=1 --account=scw1901 --partition=accel_ai --gres=gpu:1
salloc: Granted job allocation 7133017
salloc: Waiting for resource configuration
salloc: Nodes scs2042 are ready for job
Now can see your hardware allocation using
[s.1915438@sl1 experiment]$ squeue --user=s.1915438
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
7133017 accel_ai bash s.191543 R 14:08 1 scs2042
Loading Anaconda#
You can see a list of modules using module avail. And load anaconda using module load anaconda/3. Otherwise just type module load ana and use TAB from keyboard to fill remaining characters.
Once anaconda is loaded.
Create a new Conda env to install Pytorch otherwise skip this section#
Now that conda is recognisable, use conda create --name ml to create a new conda environment with name ml or any name can be used.
Activate the Conda env#
First activate the base Conda env using source activate. Then type conda env list to see a list of Conda envs. Load the newly created Conda env using conda activate ml.
Install Pytorch#
Go here:https://pytorch.org/get-started/locally/
Get the command to install a stable Pytorch with latest CUDA.
On 15th March 2022 the latest stable release is 1.11.0
Copy the command in the end on the selection table:
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorchRun this in the
mlConda env and wait. It takes time to install Pytorch with CUDA 11.3.
Run a python file#
Write any Pytorch script . For example, I created gpu.py. It checks the the availibility of CUDA as well as the GPU name.
(ml) [s.1915438@sl1 experiment]$ cat gpu.py
import torch
print(torch.__version__)
print(f"Is available: {torch.cuda.is_available()}")
try:
print(f"Current Devices: {torch.cuda.current_device()}")
except :
print('Current Devices: Torch is not compiled for GPU or No GPU')
print(f"No. of GPUs: {torch.cuda.device_count()}")
try:
print(f"GPU Name:{torch.cuda.get_device_name(0)}")
except :
print('GPU Name: No GPU available')
The easiest way is to use sftp. Check this out to use FileZilla, an sftp client.
Open filezilla and type sftp://sunbird.swansea.ac.uk into the host box. Enter your username (s.1915438) and password (uni password) in the username/password boxes. And transfer this python script to a specific directory. Next time, you can go to server menu and clickreconnect to login in a hastlefree fashion.
In the sunbird ssh session. go to the location where you transferred the gpu.py file. In the directory, run this command srun python gpu.py
(ml) [s.1915438@sl1 experiment]$ srun python gpu.py
1.11.0
Is available: True
Current Devices: 0
No. of GPUs: 1
GPU Name:NVIDIA A100-PCIE-40GB
It took me hours to understand and do whatever is written here. Just type exit, to free the node.
Two GPUs#
Once you exit the interactive session, you purge the conda module. Reload the anaconda/3 module and activate the ml Conda env.
Allocate 2 GPUs:#
[s.1915438@sl1 experiment]$ salloc --nodes=1 --account=scw1901 --partition=accel_ai --gres=gpu:2
salloc: Granted job allocation 7133023
salloc: Waiting for resource configuration
salloc: Nodes scs2042 are ready for job
Modify the gpu.py file using nano#
(ml) [s.1915438@sl1 experiment]$ cat gpu.py
import torch
print(torch.__version__)
print(f"Is available: {torch.cuda.is_available()}")
try:
print(f"Current Devices: {torch.cuda.current_device()}")
except :
print('Current Devices: Torch is not compiled for GPU or No GPU')
print(f"No. of GPUs: {torch.cuda.device_count()}")
try:
print(f"GPU Name:{torch.cuda.get_device_name(0)}")
except :
print('GPU Name: No GPU available')
try:
print(f"GPU Name:{torch.cuda.get_device_name(1)}")
except :
print('GPU Name: No GPU available')
Run the python script for 2 GPUs#
(ml) [s.1915438@sl1 experiment]$ srun python gpu.py
1.11.0
Is available: True
Current Devices: 0
No. of GPUs: 2
GPU Name:NVIDIA A100-PCIE-40GB
GPU Name:NVIDIA A100-PCIE-40GB